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(57) Abstract 

A sequential transducer, derived from a Hidden Maikov Model, that closely approximates the behavior of the stochastic model. The 
invention provides (a) a method (caUed n-type approximaticm) of deriving a simple finite-state transducer which is applicable in all cases, 
firom HMM probability nciatoices, (b) a method (called Sr-type approximation) for building a precise HMM transducer for selected cases 
which are taken from a training corpus, (c) a method for completing the precise (»-type) transducer with sequences from the simple (n-type) 
transducer, which makes the precise transducer applicable in all cases, and (d) a method (called b-type approximation) for building an 
HMM transducer with variable precision which is applicable in all cases. TTiis transformation is especially adavantageous for part-of-speech 
tagging because the icsultmg transducer can be composed with other transducers that encode correction rules for tiie most ftequent tagging 
errors. The speed of tagging is also improved. The described methods have been implemented and successfully tested on six languages. 
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FSTs approximating Hidden Markov Models and lext tagging using same 

The present invention relates to computer-based text processing, and more particularly to 
techniques for part-of-speech tagging by finite state transducer (FST) derived from a Hidden 
5 Markov Model (HMM). 

Part-of-speech tagging of machine-readable text, whereby tags are applied to words in a 
sentence so as to identify a word's part-of-speech (e.g. noun-singular, vert>-present-1st person 
plural), are known. Such tags are typically of a standardised fonm, such as specified under the 
Text Encoding Initiative (TEI). A text or corpus, once tagged, finds use, for example, in 
10 infomiation retrieval from the text and statistical analysis of the text 

Probabilistic HMM based taggers are known from Bahl and Mercer (1976) and Church 
(1988); see Section H: References at the end of this disclosure. They select among the part-of- 
speech tags that are found for a word in the dictionary, the most probable tag based on the word's 
context, i.e. based on the adjacent words. 
15 However, a problem with a conventional HMM based tagger is that it cannot be Integrated 

with other finite state tools which pertorm other steps of text analysis/manipulation. 

One method is to generate an FST performing part-of-speech tagging from a set of mies 
written by linguists (Chanod and Tapanainen, 1995). This, however, takes considerable time and 
produces very large automata (or sets of automata that cannot be integrated with each other for 
20 reasons of size) which pertorm the tagging relatively slowly. Such an FST may also be non- 
deterministic, i.e. for an input sentence it may provide more than one solution or no solutions at 



The present Invention provides methods to approximate a Hidden Markov Model (HMM) 
used for part-of-speech tagging, by a finite-state transducer. (An identical model of an HMM by a 
25 transducer (without weights) is in many cases impossible.) In specific embodiments there are 
provided (a) a method (called n-type approximatran) of deriving a simple finite-state transducer 
which is applicable In all cases, from HMM probability matrices, (b) a method (called s-type 
approximation) for building a precise HMM transducer for selected cases which are taken from a 
training corpus, (c) a method for completing the precise (s-type) transducer with sequences from 
30 the simple (n-type) transducer, which makes the precise transducer applicable in all cases, and 
(d) a metiiod (called b-type approximation) for building an HMM transducer witii variable precision 
which is applicable in ail cases. 

The invention provides a method of generating a text tagging FST according to any of 
claims 1 , 5 and 9 of the appended claims, or according to any of tfie particular embodiments 
35 described herein. 

The invention further provides a method of generating a composite finite state transducer 
by composing the HMM-derived FST with one or more further text-manipulating FSTs. 

The invention further provides a method of tagging a text using the aforementioned HMM- 
derived FST or the composite FST. 



all. 



SUBSTITUTE SHEET (RULE 26) 




wo 99/01828 PCT/EP98/04153 

2 

The invention further provioes a text processing syotem according to Ciairn 11 of the 
appended claims, and a recordable medium accoroing to claim 12 of the appended claims. 

The HMM transducer builds on the data (probability matrices) of the underiying HMM. The 
accuracy of the data collected in the HMM training process has an impact on the tagging accuracy 
5 of both the HMM itself and the derived transducer. The training of this HMM can be done on either 
a tagged or untagged corpus, and is not discussed in detail herein since it is exhaustively 
described in the literature (Bahl and Mercer, 1976; Church, 1988). 

An advantage of Rnite-State taggers according to the Invention is that tagging speed 
when using transducers is up to five times higher than when using the underiying HMM (cf. 
10 section F hereinbelow). 

However, a main advantage of the invention is that integration with tools performing 
further steps of text analysis is possible: transfonming an HMM into a FST means that this 
transducer can be handled by the finite state calculus and therefore directly Integrated into other 
finite-state text processing tools, such as those available from XEROX (Dorp., and elsewhere. 
15 Since the tagger is in the middle of a chain of text analysis tools wherB all the other tools may be 
finite-state-based (which is the case with text processing tools available from XEROX Corp.). 
converting the HMM into an FST makes this chain homogeneous, and thus enables merging the 
chain's components Into one single FST, by means of composition. 

In particular, it is possible to compose the HMM-derived transducer with, among others, 
20 one or more of the following transducers that encode: 

• conrection rules for the most frequent tagging errors in order to significantly improve tagging 
accuracy. These mies can either be extracted automatically from a corpus (Brill. 1992) or 
written manually (Chanod and Tapanainen, 1995). The rules may include long-distance 
dependencies which are usually not handled by HMM taggers. 

25 • further steps of text analysis, e.g. light parsing or extraction of noun phrases or other phrases 
(Ait-Mokhtar and Chanod, 1997). 

• criteria which decide on the relevance of a corpus with respect to a particular query in 
information retrieval (e.g. occun^nce of particular words in particular syntactic staictures). 

These transducers can be composed separately or all at once (Kaplan and Kay, 1994). Such 
30 composition enables complex text analysis to be perfomned by a single transducer see EP-A- 
583,083. 

It will be appreciated that the algorithms described herein may find uses beyond those 
discussed with respect to the particular embodiments discussed below: i.e. on any kind of analysis 
of written or spoken language based on both finite-state technology and HMMs. such as corpus 
35 analysis, speech recognition, etc. The algorithms have been fully implemented. 

Embodiments of the invention will now be described, by way of example, with reference to 
the accompanying drawings, in which: 

Rgure 1 illustrates decistons on tags with an n-type transducer, according to one 
embodiment of the invention; 



SUBSTITUTE SHEET (RULE 26) 




wo 99/01828 PCT/EP98/04153 

3 

Rgure 2 illustrates schematically the gene.Btiun o.' aii n-tyoe transducer m accordance 
with an embodiment of the invention; 

Rgure 3 illustrates the procedure of building an n-type transducer in accordance with an 
embodiment of the invention; 
5 Figure 4 is a diagram illustrating class subsequences of a sentence; 

Rgure 5 is a schematic diagram of the steps in the procedure of building an s-type 
transducer, in accordance with an alternative embodiment of the invention; 

Rgure 6 is a diagram illustrating the disambiguation of classes between two selected 

tags; 

10 Rgure 7 is a diagram illustrating valid paths through the tag space of a sentence. 

Figure 8 is a diagram illustrating b-type sequences. 

Rgure 9 is a schematic flow chart illustrating the procedure of building a b-type 
transducer; 

Rgure 10 is an illustration of the tagging of a sentence using either the n-type transducer 
15 formed in Fig. 3 or the s*type transducer formed in Rg. 5 or the b-type transducer formed In Rg. 
9; and 

Rgure 11 is a schematic flow chart of the steps involved in the procedure, in accordance 
with an embodiment of the invention, of tagging a text corpus with a finite state tagger using an 
HMM transducer. 

20 

A. System configuration 

It will be appreciated that the techniques according to the invention may be employed 
using conventional computer technology. It will be appreciated that the invention may be 
implemented using a PC running Windows™, a Mac running MacOS. or a minicomputer mnning 
25 UNIX, which are well known in the art. For example, the PC hardware configuration is discussed 
in detail in The Art of Electronics, 2nd Edn, Ch. 10, P. Horowitz and W. Hill. Cambridge University 
Press, 1989. The invention has been implemented in C on a Sun Sparc 20 workstation running 
UNIX. 



30 B. FST derivation 

The invention will be descnljed in by refererv:e to three methods of deriving a transducer 
for part-of-speech taggirig from an HMM. These methods and transducers are referred to herein 
as n-type, s-type and b-type. 

An HMM used for tagging encodes, like a transducer, a relation between two languages. 
35 One language contains sequences of ambiguity classes obtained by looking up in a lexicon all the 
words of a sentence. The other language contains sequences of tags obtained by statistically 
disambiguating the class sequences. From outside, an HMM tagger behaves like a sequential 
transducer that deterministically maps every sequence of ambiguity classes (corresponding to a 
sentence) into a unique sequence of tags, e.g.: 
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[DET] [ADJ.NOUN: [ADJ.NOUri] [SNDl 



10 



DET ADJ NOUN END 

C. n-Type Transducers 

This section presents a method that approximates a first order Hidden Markov Model 
(HMM) by a finite state transducer (FST), referred to as an n-type approximation. Figure 1 
illustrates decisions on tags with an n-type transducer, according to one embodiment of the 
invention. 

As in a first order HMM we take into account initial probabilities tz, transition probabilities a 
and class (i.e. observation symbol) probabilities 6. However, probabilities over paths are not 
estimated. Unlike in an HMM. once a decision is made, it influences the following decisions but is 
itself irreversible. 

Hgure 1 illustrates this behaviour with an example: for the class c/ of word ivt, tag t-\2 is 
15 selected which is the most likely tag in an initial position. For word w^. the tag /22. the most likely 
tag, is selected given class C2 and the previously selected tag f 72. ©tc. 

A transducer encoding this behaviour can be generated as illustrated in Rg. 2i this shows 
schematically an n-type transducer in accordance with an embodiment of the invention. In this 

example there is a set of three classes, c/ with the two tags f/y and f^^. ^2 with the three tags 
20 f27» ^22 and i23, and C3 with one tag (37. Different classes may contain the same tag, e.g. 

and t23 may refer to the same tag (e,g. [NOUN]). Figure 3 illustrates the procedure of building an 

n-type transducer in accordance with an embodiment of the invention. 

Starting with the set of tags and the set of classes (generally designated 2), for every 

possible pair of a class and a tag (e.g. 07.^72 or [ADJ,NOUN]:NOUN) a unique state is created 
25 (step si) and labelled with this same pair. This set of states will allow to map any class to anyone 

of its tags. An initial state which does not comespond with any pair, is also created in step s1. All 

states are final, marked by double circles. This produces a set of states labelled with class-tag 

pairs and 1 initial state (generally designated 4). 

For every state/as many outgoing arcs are created (step s2) as there are classes (three 
30 in Rg. 2), Each such arc for a particular class points to the most probable pair of this same class. 

The set of outgoing arcs of one state will allow to decide on the following tag. based on the 

following class and the current state's tag. If the arc comes from the initial state, the most 

probable pair of a class and a tag (destination state) is estimated by the initial and class 

probability of the tag: 

33 arg max p,{c,, t^) = 7iit^)' b(c,\t^ ) 
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If the arc comes from a state other th£.n the initial state, the -nost prooable pair is esiim^tod by the 
transition and class probability of the tag: 

argmax P2ic.j,,) = a(ty^\t^^^^J'b(c,\t^) 

In the example (Fig. 2), ci:ti2 is the most likely pair of class ci, and C2:t23 the most 
5 likely pair of class C2 when coming from the initial state, and C2:t21 the most likely pair of class C2 

when coming from the state of C3:t3i . 

Every arc is labelled with the same symbol pair as its destination state, with the class 
symbol in the upper language and the tag symbol in the lower language. E.g. every arc leading to 

the state of c-|:tt2 is labelled with ci:t'|2- The result is the non-minimal and non-deterministic FST 
10 6. 

Rnally, all state labels can be deleted since the behaviour described above is encoded in 
the arc labels and the network stmcture. The network can be minimised and determinised (step 
s3) to produce the n-type FST 8. 

The above mathematical model is referred to as an nl-type model, the resulting finite- 
15 state transducer an nl-type transducer and the whole algorithm leading from the HMM to this 
transducer, an nl-type approximation of a first order HMM. 

Adapted to a second order HMM, this algorithm would give an n2-type approximation. 
Every n-type transducer is sequential, i.e. deterministic on the input side. In every state 
there is exactly one putgoing arc for every class symbol, fyn n-type transducer tags any corpus 
20 deterministically. 

D. s^Type Transducers 

This sectfon presents a method, according to another emtxxiiment of the invention, that 
approximates a first order Hidden Markov Model (HMM) by a finite state transducer (FST), 
25 ret ened to herein as an s-type approximatron. 

D.1 Nlathematical Background 

To tag a sentence i.e. to map its class sequence to the most prot)abIe tag sequence, one 
can split the class sequence at the unambiguous classes (containing one tag only) into 
30 subsequences, then tag them separately and concatenate tfiem again. The result is equivalent to 
the one obtained by tagging the sentence as a whole. 

Rgure 4 is a diagram illustrating class subsequences of a sentence. Two types of 
subsequences of classes are distinguished: initial and middle ones. The final subsequence of a 

sentence is equivalent to a middle one. if it is assumed that the sentence end symbol (. or I or ?) 
35 always corresponds to an unambiguous dass cy. 
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An initial subsequence Cj starts with the sentence initiai position and hftft any numb3r 

(inci. zero) of ambiguous classes and ends with the first unambiguous class cu of the sentence. It 
can be described by the regular expression: 

5 Given an initial class subsequence Q of length r. its joint probability together with a 

corresponding initial tag subsequence F/can be estimated by: 



p(q.'Il) = 7i:{t,)^b{c,\t,) 



r-l 



Yla{tj\tj_,)'Kcj\tj) 

Li=2 



10 A middle subsequence Cu) starts immediately after an unambiguous class c^. has any 

number (incL zero) of ambiguous classes Ca and ends with the following unambiguous class cu. It 
can be described by the regular expression: 

To estimate the probability of middle tag sequences Tm correctly, we have to include the 

15 immediately preceding unambiguous class cu actually belonging to the preceding subsequence 

C/or Cm Thus we obtain an extended middle subsequence: 
_ e 0 

The joint probability of an extended middle class subsequence Cm of length s together 

e 

with a conresponding tag subsequence T/n can be estimated by 

Vl 



20 



l[a(tj\tj_,yb(cj[tj) 
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Building an s-type transducer 

To build an s-type transducer, we generate a large number of initial class sequences C/ 

e 

and extended middle class sequences C^i as described in D.3 below, and disambiguate (i.e, 

tag) each of them based on a first order HMM using the Viterbi algorithm for efficiency (Werbi. 
5 1 967; Rabiner, 1990). 

Every class subsequence gets linked to its most probable tag subsequence by means of 
the cross product operation: 

S/ = C/.x. 7/=C7:ft C2:t2 Cn:tn 

e & 6 G G 

Sm =0m ^'Tm =ci :ti C2:t2 cn:tn 

U U e 

10 Then the union S/ of all initial subsequences Sj and the union of all extended middle 

e 

subsequences Sm is built. 

e 

In all exterujed middle subsequences Sm . like e.g. 

Cm^ [DETl IADJ,NOU^q [ADJ,NOUM] [NOUN] 

15 Tm^ DET ADJ ADJ NOUN 

the first class symbol on the upper side and the first tag symbol on the lower side will be marked 
as an extension that does not really belong to the middle sequence but is nec^essary to 
disambiguate it correctly. The above example becomes 
0 

20 Cm 0.[DET1 [ADJ,NOUN] [ADJ,NOUN] [NOUN] 

_ ~^ 

0 

Tm O.DET ADJ ADJ NOUN 

Now it is possible to formulate a preliminary sentence model containing one initial subsequence 
25 followed by any number (incl. zero) of extended middle subsequences: 

UO 0 
S = Si Sm * 

where all middle subsequences Sm are still marked and extended in the sense that all 

occurrences of alt unambiguous classes are mentioned twice: Once at the end of every sequence 
(unmarked) and also at the beginning of every following sequence (marked). 
30 To ensure a correct concatenation of initial and middle subsequences a concatenation 

constraint for classes is formulated: 
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0 

stating that every middle subsequence must begin with the saine Otarked unrnbinuous class Cj 

(e,g. 0.[DET1) that occurs unmarked as Cu (e.g. [DET]) at the end of the preceding subsequence 

since both symbols refer to tfie same occunrerKre of this unambiguous class. 

Having ensured correct concatenation by composition of the preliminary senterK:e model 

U 0 

5 S with the concatenation constraint Rc, all marked classes on the upper side and all marked 

tags on the lower side of the relation are deleted. 

The above mathematical model is referred to as an s-type model, the corresponding 
finite-state transducer an s-type transducer and the whole algorithm leading from the HMM to the 
transducer, an s-type approximatior) of an HMM. 

10 An s-type transducer tags any corpus that does not contain unknown subsequences 

exactly the same way, i.e. with the same errors, as the corresponding HMM tagger does. An s- 
type transducer is, however, incomplete t>ecause it encodes only a limited set of subsequences of 
ambiguity classes. Therefore an s-type transducer cannot tag sentences with one or more 
subsequences that are not in tiiis set. An s-type transducer can be completed as described in D.4 

15 bek>w. 



D.3 Generation of ambiguity class subsequences 

This section describes three ways to obtain class subsequences that are needed to build 
an s-type transducer. Rgure 5 is a schematic diagram of the steps in the procedure of building an 
20 s-type transducer, in accordance witii an altemative embodiment of tiie invention. 

(a) Extraction from a corpus 

Based on a lexicon and a guesser an untagged training corpus (10) is annotated with 
class labels, exactiy the same way as is done later for the purpose of tagging. 

From every sentence we extract (step s4) the initial class subsequence C/that ends with 
25 the first unambiguous class Cu, and all extended middle subsequences Crn^ ranging from any 

unambiguous class in the sentence to the following unsimbiguous class. This generates the 

incomplete set of class subsequences (s-type), designated 12. 

(b) Generation of all possible subsequences 

Here, the set of tags and set of classes (generally designated 2) is used as the starting 

30 point. The set of all classes c is split into the subset of unambiguous classes cy and the subset of 

ambiguous classes c^. Then all possible initial and extended mkJdIe class subsequences, C/ and 
e 

C/n up to a defined length are generated (step s5). 

As in D.3(a), an incomplete set of dass sequences (s-type) 12 is obtained, and the HMM 
can be used (step s7) to obtain from this an incomplete set of class and tag sequences (s-type) 
35 14. This set 14 can be used with finite state calculus to build (step s8) an s-type transducer 16; 
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and this incomplete s-type FST in turn is used to derive the incomplete set 1 through extraction 
(step s10) of all class sequences, 
(c) Extraction from a transducer 

If an n-type transducer N approximating an HMM is already available. Initial and extended 

e 

5 middle class and tag subsequences, Sj and S/n can be extracted (step $6) using finite state 
calculus, to produce a complete set 17 of class and tag sequences (n-type). 

D.4 Completion of s-type transducers 

s-Type transducers with class subsequences that have been generated as described in 
10 D.3(a) or D.3(b) above, are in general not complete on the input side. They do not accept all 
possible ambiguity class sequences. This, however, is necessary for a transducer used to 
disambiguate class sequences of any corpus since a new corpus can contain sequences not 
encountered in the training corpus. 

An incomplete s-type transducer S can be completed (step si 2) with subsequences from 
15 an auxiliary complete n-type transducer N as follows: 

U U e 

Rrst the unions of initial and extended middle subsequences, 5S/ and 5S/n are 

U U e 

extracted from the primary s-type transducer S, and the unions pSg and pSm sire extracted 

from the auxiliary n-type transducer N, as described In section D.3(c) above. 
U 

Then a joint union S/ of initial subsequences is made: 

U U U u U 

20 Si= sSi I [ [ nSi.U' sSau ].o. „S/ ] 

U e 

and a joint union Sni of extended middle subsequences: 

U e U e u 9 U 0 U 0 
Sm = sSm I t [ nS/n - s^m 1 o. n^m 1 

In both cases all subsequences from the principal model S are unioned with all those 
subsequences from the auxiliary model A/that are not in S, 
25 Rnalty, the complete s+n-type transducer 18 is generated (step si 4) from the joint unions 

U U e 

of subsequences S/and S/n , as described in section D.2 above. 

A transducer completed in this way disambiguates all subsequences known to the 
principal incomplete s-type model exactly as the underiying HMM does, and all other 
subsequences as the auxiliary n-type model does. 

30 

E. b-Type Transducers 

This section presents a method, according to another embodiment of the invention, that 
approximates a first order Hidden Markov Model (HMM) by a finite-state transducer (FST), 
referred to herein as t>-type approximation. Regular expression operators used in this section are 
35 explained in the annex. 
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E.1 Basic Idea 

Tagging of a sentence based on a (first order) HMM includes finding the most probable 
tag sequence given the class sequence of the sentence. 
5 In this approach, an ambiguity class is disambiguated with respect to a context. A 

context consists of a sequence of ambiguity classes limited at both ends by some selected 
tag. For the left context of length ^ we use the term look-back, and for the right context of 
length a we use the term look-ahead 
2 

In Rg. 6, the tag f /can be selected from the class c/ because it is between two selected 

1 2 
10 tags which are f /-2 at a look-back distance of ^2 and t j+2 a look-ahead distance of ot=:2. 

1 2 

Actually, the two selected tags t i~2 t i+2 allow not only the disambiguation of the class c/ but 

of all classes inbetween. I.e. c/.^, c/and c'l^^. 

We approximate the tagging of a whole sentence by tagging sub-sequences with selected 
tags at both ends (Rg. 6), and then overlapping them. The most probable paths in the tag space 
15 of a sentence, i.e. valid paths according to this approach, can be found as sketched in Rg. 7, An 
ordered set of overiapping sequences where every sequence is shifted by one tag to the right with 
respect to the prevbus sequence, constitutes a valid path. There can be more than one valid path 
in the tag space of a sentence (Rg. 7). Sets of sequences that do not overlap in such a way are 
incompatible according to this model, and do not constitute valid paths. 

20 

EJl b-Type Sequences 

Given a length p of look-back and a length a of look-ahead, we generate for every class 

CO, every look-back sequence t_p c_^j ... c_j, and every look-ahead sequence Cj ... c^_j t^^^, a b- 
type sequence: 

25 t_p c_p^i...c_^ CO Cj...c^i t„ 

For example: 

CONJ [DET,PR0N1 [ADJ,NOUN,VERB] [NOUN.VERB] VERB 

Each such original b-type sequence (Rg. 8) is disambiguated based on a first order HMM. 
Here we use the Viterbi algorithm (Viterbi, 1967; Rabiner, 1990) for efficiency. 
30 The algorithm will be explained for a first order HMM. In the case of a second order HMM, 

b-type sequences must begin and end with two selected tags rather than one. 

For an original b-type sequence, the joint probabii'rty of its class sequence C with its tag 
sequence T(Rg. 8), can be estimated by: 
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A boundary, i.e. a sentence beginning or end, may occur position in the look-back 
sequence and In the look-ahead sequence. No look-back (P=0) or no look-ahead (a^) is also 
allowed. The above probability estimation can then be expressed more generally (Rg. 8) as: 

5 li^^T^-Pstart'Pniddle'Peni 

with pstert being 

Pstan "^^-^'^-^) for fixed tag t_p 

P start ^^-fi^l ) for sentence beginning # 

P start ~ ^ for P = 0, i.e. no look-back 

10 with pmiddle being 

a-L 

Pndddie =K^o'^o) forp4<x==0 
and with Pend being 

15 Pend ^ ^(^a'^a-l) for fixed tag 

Pend ~ ^ sentence end # or a= 0, i.e. no look-ahead 

When the most likely tag sequence is found for an original b-type sequence, the class cq 

in the middle position is associated with its most likely tag tQ. We formulate constraints for the 

other tags f^and f^and classes ... and ... c^j of the original b-type sequence. Thus 
20 we obtain a tagged b-type sequencer, 

t^j c_^i ".c_j co:to Cj...c„,j t„ 

stating that is the most probable tag in the class c(j if it is preceded by ... and 

followed by Cj „. t^ 
In the example: 
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CONJ-B2 [DET,PRONl-B1 tADJ,NOUN.VEP.B]:AD0 [NOUN,VERB]-A1 VEHB-A2 
ADJ is the most likely tag in the class [ADJ,NOUN,VERB] if it Is preceded by the tag CONJ two 
positions back (62), by the class [DET.PRON] one position back (B1), and followed by the class 
[NOUN.VERB] one position ahead (A1) and by the tag VERB two positions ahead {A2). 

5 Boundaries are denoted by a particular symbol, #, and can occur anywhere in the look-back and 
look-ahead. For example: 

#-B2 [DET,PR0NI-B1 [ADJ,NOUN.VERB]:ADJ [NOUN.VERB]-A1 VERB-A2 
C0NJ-B2 [DET,PR0N]-B1 [ADJ,NOUN.VERB]:NOUN #-A1 

Note that look-back length p and look-ahead length a also include all sequences shorter 
10 than p or a, respectively, that are limited by a boundary #, 

Figure 9 is a schematic diagram illustrating the steps in the procedure of building a b-type 
transducer, according to an embodiment of the invention. 

For a given length P of look-back and a length a of look-ahead, and using the set of tags 
and the set of classes, every possible original b-type sequence is generated (step s90). Then, 
15 these are disambiguated statistically, and encoded it in a tagged b-type sequence B/ as described 

above (Viterbi; step s92). All sequences B/are then unioned and a preliminary tagger model b' 
generated (step s94): 



where all sequences S/ can occur in any order and number (including zero times) because no 
20 constraints have yet been applied. 

Concatenation Constraints 

To ensure a correct concatenation of sequences, it is necessary to make sure that every 
sequence Bj is preceded and followed by other B/ according to what is encoded in the look-back 
25 and look-ahead which were explained atx)ye. 

Constraints are created for preceding and following tags, classes and sentence 
boundaries. For the look-back, a particular tag tj or class Cj is requested for a particular distance 
of 8<-l, by: 




30 
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for 5^-1 

with f and c being the union of all tags and all classes respectively. A sentence beginning, #, is 
requested for a particular look-back distance of 5<-l, by: 

5 forfe-1 

In the case of look-ahead, for a particular distance of a particular tag X\ or class cy or a 
sentence end # is required in a similar way, by: 

/?^(cp=-[?*c;^ ~[i\"c]* i\"cm^-i) c,.?*]] 

forfel 

We create the intersection f?f of all tag constraints R (ti), the intersection Rc of all class 

^ S 
constraints R (cp, and the intersection /?#of all sentence boundary constraints R (^): 



15 



/^=0^(#) 



All constraints (for tags, classes and sentence boundaries) are enforced (step s96 In Rg. 
9) by composition with the preliminary tagger model B\ The class constraint Rc "^ust be 

composed on the upper side of 6' which is the side of the classes, and both the tag constraint flf 

20 and the boundary constraint fl# must be composed on the lower side of B' which is the side of the 
tags: 

0" = Rc .0. B' .0. Rt .0. R# 
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Having ensured conrect concatenation, all symbols that have served to constrain ^ags, 
classes or boundaries are deleted (from B')(step s98). Rnally. the FST is detemninized and 
minimized. 

5 The above-mentioned mathematical model is referred to as a b-type model, the 

conresponding FST as a b-type transducer, and the vvhole algorithm leading from the HMM to the 
transducer, as a ty-type approximation of an HMM. 

E. 4 Properties of b-Type Transducers 

10 There are two groups of b-type transducers with different properties: FSTs without look- 

back or without look-ahead and FSTs with both look-back and look-ahead. Both accept any 
sequence of ambiguity classes. 

b-Type FSTs without look-back or without look-ahead are always sequential. They map a 
class sequence that corresponds to the word sequence of a sentence, always to exactly one 

15 tag sequence. Their tagging accuracy and similarity with the underlying HMM increases 
with growing look-back or look-ahead. A b-type FST with look-back p=l and without look- 
ahead (a=0) is equivalent to an n1-type FST (section C). 

b-Type transducers with both look-back and look-ahead are in general not sequential. For 
a class sequence corresponding to the word sequence of a sentence, they deliver a set of 

20 altemative tag sequences, which means that the tagging results are ambiguous. This set is never 
empty, and the most probable tag sequence according to the underiying HMM is always in this 
set. The longer the kx>k-back distance ^ and the look-afiead distance a are, the larger the 
FST and the smaller the set of altemative tagging results. For sufficiently large look-back 
plus look-ahead, this set may contain always only one tagging result In this case the b- 

25 type FST is equivalent to the underiying HMM. For reasons of size however, this FST is only 
computable for HMMs with small tag sets. 

F. An Implemented Rnite State Tagger 

The implemented tagger requires three transducers which represent a lexicon, a guesser 
30 and an approximatksn of an HMM, Rgure 10 is an illustration of the tagging of a sentence using 
either the n-type transducer of Rg. 3, the s-type transducer formed in Rg. 5 or the b-type 
transducer fonmed in Rg. 9. Rgure 11 is a schemata flow chart of the steps involved in the 
procedure, in accordance with an embodiment of the invention, of tagging a text corpus 20 with a 
finite state tagger using an HMM transducer. 
35 Every word 21 of an input sentence is read (step s71) from the corpus and is initially 

looked for in the lexicon; and if this fails, the search continues in the guesser (step s72), resulting 
in a word labelled with a class (22), As soon as an input token gets labelled with the sentence end 
class (e.g. [SENT| in Rg. 10), the tagger stops (step s73) reading words from the input. At that 
point the tagger has read and stored the words of a whole sentence (Fig. 6, col. 1) and generated 
40 the corresponding sequence of classes 23 (see Rg. 10, col. 2). 
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The class sequence 23 is now deterministically mappod (s*ep s74) to a tag sequence 24 
(see Rg. 1 0, coL 3) by means of the HMM transducer. 

The tagger outputs (step s75) the stored word and tag sequence 24 of the sentence and 
continues (step s76) the same way with the remaining sentences of the corpus, stopping (step 
5 s77) when the end of the corpus is reached. The end result is the tagged text corpus 25. 

G. Tests and Results 

Table 1 compares an n-type and an s-type transducer with the underlying HMM on em 
English test case. As expected, the transducers perform tagging faster than the HMM. 

10 





tagging 

accuracy 

in% 


tagging speed on different 
computers in words/sec 


transducer size 


on ULTRA2 


on SPARC20 


num. states 


num. arcs 


HMM 


96.77 


4 590 


1 564 


none 


none 


s-type transducer 


95.05 


12 038 


5 453 


4 709 


976 785 


n-type transducer 


94.19 


17244 


8 180 


71 


21 087 



Language: 


English 


HMM training corpus: 


19 944 words 


Test corpus: 


19 934 words 


Tag set 


74 tags 297 classes 


s-Type transducer 


with subsequences from a training corpus of 100,000 words 
completed with subsequences from an n-type transducer 



Table 1: Accuracy, speed, size and creation time of HMM transducers 



15 Since both transducers are approximations of HMMs, they show a slightly lower tagging 

. accuracy than the HMMs. However, improvement in accuracy can be expected since these 
transducers can be composed with transducers encoding correction rules for frequent errors, as 
described above in the introductory part of this disclosure. 

The s*type transducer is more accurate in tagging but also larger and slower than the n- 
20 type transducer. 

Table 2 compares the tagging accuracy of n-type and s-type transducers and the 
underling HMM for different languages. 
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Tagging accuracy in % 


English 


Dutch 


French 


Gemnan 


Portug. 


Spanish 




96.77 


94,76 


98.65 


97.62 


97,12 


97.60 


s*type transducer 


95.05 


92.36 


98.37 


95.81 


96.56 


96.87 


n-type transducer 


94.19 


91.58 


98.18 


94.49 


96.19 


96.46 



s-Type transducer with subsequences from a training corpus of 1 00.000 words 
completed with subsequences from n-type transducer 



Table 2; Accuracy of HMM transducers for different languages 

5 

For the test of b-type transducers we used an English corpus, lexicon and guesser, that 
originally were annotated with 74 different tags. We automatically receded the tags in order to 
reduce their number. i.e. in some cases more than one of the original tags were recoded into one 
and the same new tag. We applied different recodings, thus obtaining English corpora, lexicons 
10 and guessers with reduced tag sets of 45, 36, 27. 18 and 9 tags respective>y. 



Transducer 
or 
HMM 


Accuracy 

test corp. 
in% 


Tagging speed 
in words/second 


Transducer size 


Craation 

time 
on Ultra2 




nufn.of 
states 


num.of 
arcs 




on Ultra2 


onSparc20 


HMM 


97.07 


3 358 


1 363 








b-FST{p=0.a=0) 


94.47 


25521 


11 815 


1 


119 


3sec 


b-FST(P=1,ct=0) 


95.60 


25 521 


12038 


28 


3332 


4sec 


b-FST (P=2.a=:0) 


97.71 


25 521 


11 193 


1 642 


195 398 


32 min 


t>-FST(M).a=1) 


95.26 


17 244 


9 969 


137 


14 074 


5sec 


b-FST (M*oe=2) 


95.37 


19 939 


9 969 


3685 


280 545 


3 min 


b-FST (p=1ta=1) 


•96.78 


16 790 


8 986 


1 748 


192 275 


19 sec 


b-FST {P=2,a=1) 


•97.06 


22 787 


11 000 


19 878 


1 911 277 


61 min 



Language English 

Corpora 1 9 944 words for HMM training, 1 9 934 words for test 

Tag set 27 tags. 119 classes 

* Multiple. i.e. a mbiguous tagging results: Only first result retained 

Types of FST (Finite State Transducers) 

b {p=:2,a=:1 ) b-type transducer with look-back of 2 and look-ahead of 1 

Computers 

Ultra2 1 CPU, 512 MBytes physical RAM, 1 .4 GBytes virtual RAM 

Spaic20 1 CPU, 192 MBytes physical RAM. 827 MBytes virtual RAM 



Table 3: Accuracy, tagging speed and size of some HMM transducers 
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Table 3 compares b-type transducers with drfferpnt length of look-back af,d look-ah3id 
for a tag set of 27 tags. The highest accuracy (97.06 %) could be obtained with an b-type FST 
with p=2 and cc=1. This b-type FST produced in some cases ambiguous tagging results. Then 
only the first result found was retained. 



Transducer 
HMM 



or 



Tagging accuracy with tag sets of different sizes (in %) 



74 tags 
297 cIs. 



45 tags 
214 ds. 



36 tags 
181 cIs. 



27 tags 
119 cIs. 



18 tags 
97cls. 



9 tags 
67cls. 



HMM 



b-FST (P=0.cc=0) 



96.78 



96.92 



97.35 



97.07 



83.53 



83.71 



87.21 



94.47 



96.73 



94,24 



95.76 



93.86 



b-FST (p=1 .0=0) 



94.19 



94.09 



95.16 



95.60 



95.17 



94.14 



b-FST (p=2,a=0) 



b-FST (P=0.a=l) 



94.28 



95.32 



95.71 



9^79 



92.47 



93.69 



95.26 



95.31 



95.19 



94.22 



94.64 



b-FST (M>»a=2) 



b-FST {P=1,a=:1) 



93.46 



9Z77 



93.92 



95.37 



*94.94 



•95.14 



•95.78 



*96.78 



95.30 



*96.59 



94.80 



*95.36 



b-FST (P=2.a=1) 



•97.34 



•97.06 



*96.73 



*95.73 



b-FST (p=3.a=1) 



95,76 



Language English 

Corpora 1 9 944 words for HMM training, 1 9 934 words for test 

Types of FST (Rnite State Transducers) see Table 3 

, Multiple, i.e, ambiguous tagging results: Only first result retained 
j Transducer could not be computed for reasons of size. 



Table 4: Tagging accuracy with tags sets of different sizes 



1*^ Table 4 shows the tagging accuracy of different b-type transducers with tag sets of 

different sizes. To get results that are almost equivalent to those of an HMM, a b-type FST 
needs at least a look-back of and a look-ahead of oc=1 or vice versa. For reasons of 
size, this kind of FST could only be computed for the tag sets with 36 tags or less. A b- 
type FST with p=3 and a=i could only be computed for the tag set with 9 tags. This FST 

15 gave exactly the same tagging results as the underlying HMM. 
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Annex: Regular Expression Operators of tha XEROX Finite State Calculus 

Below, a and b designate symbols, A and B designate languages, and R and Q designate 
relations between two languages. More details on the following operators and pointers to finite- 
► state literature can be found in http: //vww.rxrc .xerox.com/research/mltt/fst 

$A Contains. Set of strings containing at least one occurrence of a string from A as a 

substring. 

-A Complement (negation). AH strings except those from A. 

Term complement Any symbol other than a. 
A* Kleene star. Zero or more times A concatenated with itself. 

A^n A n times. Language A concatenated n times with itself. 

A+ Kleene plus. One or more times A concatenated with itself. 

Replace. Relation where every a on the upper side gets mapped to a b on the lower 
side. 

Inverse replace. Relation where every b on the lower side gets mapped to an a on 
the upper side. 

a : b Symbol pair with a on the upper and b on the bwer side. 

R . u Upper language of the relation R. 

^ • 1 Lower language of the relation R. 

R-i Inverse relation where the upper and the lower language are exchanged with 

respect to R. 

A B Concatenation of all strings of A with all strings of B. 

A I B Union of the languages A and B. 
A & B Intersection of the languages A and B. 

Relative complement (minus). All strings of the language A that are not in B, 
Cross Product (Cartesian product) of the languages A and B. 
H -o. Q Composition of the relations R and Q, 
0 or ( ] Empty string (epsilon). 

Any symbol in the known alphabet and its extensions. 



a -> b 



a <- b 



A - B 
A -X- B 
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CLAIMS: 



1 . A method carried out in a text processing system, for generating a text tagging finite state 
transducer (FST) for a predetermined language, the finite state transducer encoding along a 

5 plurality of arcs sets of ordered pairs of upper and lower labels, comprising: 

(a) providing a set of tags, the or each tag being a part-of-speech designator for the 
language, and a set of classes, the or each class being ambiguity classes, defining 
groups of possible tags, for words in the language, 

(b) providing a plurality of states, including an initial state and, for each class-tag pair, a 
10 further state labelled as the respective clciss-tag pair, and 

(c) for each state, for each class, creating an arc between the state and a destination 
state labelled with a class-tag pair, the class tag pair being the most probable for 
the class, the arc so added having the class as its upper label and a tag of that 
class as its lower label. 

15 

2. The method of claim 1, wherein (c) comprises, where the state is the initial state. (c1) 
detemiining the most probable class tag pair using 

argnfxpi(c;,y =/r(^J•6(c,lf^) 

where 7C is initial probability, and b is dass probability. 

3. The method of claim 1 or 2. wherein (c) comprises, where the state is not the initial state, 
(cl') determining the most probable class tag pair using 

where a is transition probability, and b is class probability. 

4. The method of claim 1, 2 or 3, further comprising minimising and determinising the FST 
generated in steps (a)- (c). 

5. A method canied out in a text processing system, for generating a text tagging finite state 
30 transducer (FST) for a predetermined language, the finite state transducer encoding along a 

plurality of arcs sets of ordered pairs of upper and lower labels, comprising: 

(a) providing an incomplete set of class sequences, the class sequences being 
sequerrces of classes, the or each class being ambiguity classes, defining groups 
of possible tags, for words in the language, 
35 (b) tagging the class sequences using a Hidden Mari<ov Model, to produce an 

incomplete set of class and tag sequences of a first type, 
(c) providing a complete set of class and tag sequences of a second type. 



20 



25 
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(d) combining the set of class and tag sequences or a rirst type with the set o* clas? 
and tag sequences of a second type, and 

(e) building a FST using the combination obtained in step (d), 

5 6. The method of claims 5, wherein step (a) comprises: 
(a1 ) providing a set of tags and a set of classes, and 

(a2) creating from said set of tags and set of classes, ail possible class sequences up to 
a predetemnined length. 

10 7. The method of claim 5, wherein step (a) comprises: 
(a1 *) providing an untagged text corpus, and 
(a2') extracting all class sequences from the untagged text corpus. 

8. The method of claim 5, 6 or 7, wherein step © comprises: 

15 (c1 ) providing a FST generated according to the method of any of claims 1 to 4, 

(c2) extracting all class sequences from the FST provided in step (c1). 

9. A method carried out in a text processing system, for generating a text tagging finite-state 
transducer (FST) for a predetermined language, the FST encoding along a plurality of arcs sets of 

20 ordered pairs of upper and lower labels, comprising: 

(a) providing a set of tags, the or each tag being a part-of-speech designator for the 
language, and a set of classes, the or each class being an ambiguity class, defining 
groups of possible tags, for the language, 

(b) creating from the set of tags and the set of classes a set of subsequences of 
25 ambiguity classes, said set of sut)sequences comprising all possible subsequences 

for a predetermined look-back value and look-ahead value, 

(c) using a Hidden Maricov Model, tagging sakJ set of subsequences to generate a set 
of tagged subsequences, 

(d) performing a union operation, folbwed by a Kleene star operation, on the set of 
30 tagged subsequences, to generate a preliminary tagger model, said preliminary 

tagger model being a sequence in which the tagged subsequences may appear in 
any order or in any number. 

(e) composing a plurality of constraints with the preliminary tagger model to generate 
the tagging FST, said constraints including constraints for tags, classes and 

35 sentence boundaries, and deleting all symbols expressing constraints. 

10. A method carried out in a text processing system, fore tagging an untagged text corpus, 
comprising: 

(a) reading, in sequence, the or each word of the text corpus, 
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(b) 



for the or each word, looking up the word using a texical .-esouxe, to obtain thft 
word labelled with its class. 



(c) 



if a sentence end token has not been read, repeating steps (a) and (b), the 
sentence end token defining the end of the current sentence. 



5 



(d) 



if a sentence end token has been read, applying the FST obtainable by the method 
of claims 1, 5 or 9 to the class sequence corresponding to the cun*ent sentence, to 
generate a tag sequence, the tag sequence comprising a sequence of tags. 



(e) 



outputtlng a tagged form of the current sentence, the tagged form comprising the or 
each word of the current sentence and, appended thereto, a respective tag from 
said tag sequence, and 



10 



(f) if the end of the text corpus has not been reached, repeating steps (a)-(e), 

11. A text processing system when suitably programmed for carrying out the method of any of 
the preceding claims, the system comprising a processor and memory, the processor being 

15 operable with said memory for executing instructions corresponding to steps of any of said 
methods. 

12, A recordable medium having recorded thereon digital data defining instructions for 
execution by a processor and con*esponding the steps of the methods of any of claims 1 to 10. 

20 
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